arXiv:1505.00296vl [cs.CV] 2 May 2015 


Object-Scene Convolutional Neural Networks for Event Recognition in Images 


Limin Wang 1,2 Zhe Wang 2 Wenbin Du 2 Yu Qiao 2 

1 Department of Information Engineering, The Chinese University of Hong Kong 
2 Shenzhen key lab of Comp. Vis. & Pat. Rec., Shenzhen Institutes of Advanced Technology, CAS, China 

07wanglimin@gmail.com, buptwangzhe2012@gmail.com, wb.du@siat.ac.cn, yu.qiao@siat.ac.cn 


Abstract 

Event recognition from still images is of great impor¬ 
tance for image understanding. However, compared with 
event recognition in videos, there are much fewer research 
works on event recognition in images. This paper addresses 
the issue of event recognition from images and proposes an 
effective method with deep neural networks. Specifically, 
we design a new architecture, called Object-Scene Con¬ 
volutional Neural Network (OS-CNN). This architecture is 
decomposed into object net and scene net, which extract 
useful information for event understanding from the per¬ 
spective of objects and scene context, respectively. Mean¬ 
while, we investigate different network architectures for OS- 
CNN design, and adapt the deep (AlexNet) and very-deep 
(GoogLeNet) networks to the task of event recognition. Fur¬ 
thermore, we find that the deep and very-deep netw’orks 
are complementary to each other. Finally, based on the 
proposed OS-CNN and comparative study of different net¬ 
work architectures, we come up with a solution of five- 
stream CNN for the track of cultural event recognition at the 
ChaLearn Looking at People (LAP) challenge 2015. Our 
method obtains the performance of 85.5% and ranks the 
1 st place in this challenge. 

1. Introduction 

Event recognition from still images is one of the chal¬ 
lenging problems in computer vision research. While many 
efforts have been devoted to the problem of video-based 
event and action recognition [8, 10, 12,13, 14, 15], there are 
much fewer research works on image-based event recogni¬ 
tion [6, 17]. Compared with images, videos are able to pro¬ 
vide more useful information for event undertanding such as 
motion, in addition to static appearance. Therefore, event 
recognition from still images poses more challenges than 
videos. Meanwhile, the concept of event itself is extremely 
complex, as the characterization of an event is related to 
many factors, including objects, human poses, human gar¬ 
ments, scene categories, and other context. Therefore, the 


event recognition is highly related with other high-level 
computer vision problems, such as object recognition [5] 
and scene recognition [19]. In this paper, we propose an ef¬ 
fective method for the track of cultural event recognition at 
the ChaLearn Looking at People (LAP) challenge 2015 [1], 
which obtains the performance of 85.5% and ranks the 1 st 
place in this challenge. 

Specifically, we propose a new architecture for event 
recognition, called Object-Scene Convolutional Neural Net¬ 
work (OS-CNN), which extracts the important visual cues 
of both object and scene for event understanding. The OS- 
CNN is decomposed into two separate nets, namely object 
net and scene net. The object net aggregates important in¬ 
formation for recognizing event from the perspective of ob¬ 
ject, while the scene net performs event recognition with 
the help of scene context. The cues of containing object 
and scene context provide complementary information for 
event understanding from still images. The recognition re¬ 
sults from object net and scene net are combined by late fu¬ 
sion. Decoupling the object and scene nets also allows us to 
exploit the availability of large amounts of annotated image 
data by pre-training object net on the ImageNet challenge 
dataset [3] and scene net on the Places dataset [19], 

Meanwhile, there are many famous and successful net¬ 
work architectures for CNNs, such as AlexNet [5], Clari- 
faiNet [18], GoogLeNet [11], and VGGNet [9], These ar¬ 
chitectures have proved to be effective for object and scene 
recognition, and obtained the state-of-the-art performance 
on the datasets of ImageNet and Places [7, 19], However, 
their performance on event recognition and the complemen¬ 
tarity among them has not been fully explored before. In our 
proposed OS-CNN, we exploit these successful deep archi¬ 
tectures for event recognition, and further boost the recog¬ 
nition performance by using ensemble of them. Linally, 
based on our OS-CNN and comparative study of different 
network architectures, we come up with a solution of five- 
stream CNN for the ChaLearn LAP challenge 2015. 

The rest of this paper is organized as follows. In Section 
2, we describe the technical details about our OS-CNN. We 
then provide the implementation details and experimental 
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Figure 1. The architecture of Object-Scene Convolutional Neural Network (OS-CNN) for event recognition. We pre-trained the object 
CNN on the ImageNet dataset and the scene CNN on the Places dataset. It should be noted that we choose the ClarifaiNet architecture 
for CNN in this illustration. But, in practice, we may choose other architectures for both object and scene CNN and even fuse multiple 
different architectures. 


results in Section 3. Finally, we conclude our method and 
present the future works in Section 4. 

2. Object-Scene CNNs 

Event understanding is highly related with other two 
high-level computer vision problems: object and scene 
recognition. Therefore, we utilize two separate components 
for event recognition. The object stream, pre-trained in 
large object dataset (ImageNet), carries information about 
object depicted in the image. The scene stream, pre-trained 
in large scene dataset (Places), captures the pattern about 
scene context of this image. We design our event recogni¬ 
tion architecture accordingly and propose a new network ar¬ 
chitecture, called Object-Scene CNN (OS-CNN) as shown 
in Figure 1. Each CNN is pre-trained on its own dataset 
and fine tuned for event recognition on the target dataset. 
We use late fusion to combine the scores of two separate 
CNNs. 

2.1. Object nets 

We hope that the object net is able to capture useful in¬ 
formation of object to help event recognition. As the object 
net is essentially dealing with object cues, we build it with 
the help of the recent advances on large-scale image recog¬ 
nition methods [5], and pre-train the network on a large im¬ 
age classification dataset, such as the ImageNet dataset [3], 
Specifically, we first choose the ClarifaiNet network archi¬ 
tecture [ 1 8] and use the pre-trained model in [2] 1 . Then, we 
fine tune the model parameters for the task of event recog¬ 
nition on the training dataset provided by the challenge or¬ 
ganizers. The details about the network architecture can be 
referred to its original paper [18] and the details about the 
fine tuning of network parameters can be found in Section 


3. Next, we describe the scene net, which exploits scene 
information for event recognition. 

2.2. Scene nets 

The scene net is expected to extract the scene informa¬ 
tion of image to help conduct event recognition. Hence, 
the scene net is designed for handling scene context, and 
we may resort to recent advances on the problem of scene 
recognition. Places dataset [ 19] is a recent large dataset and 
it contains 205 scene categories with 2.5 millions of im¬ 
ages. Specifically, we first use the pre-trained model in [19] 

2 , which choose the famous AlexNet architecture [5], Sim¬ 
ilar to object net, we then fine tune the model parameters 
on the training dataset from the cultural event recognition 
challenge. The details about the network architecture can 
be found in its original paper [5] and the details about the 
fine tuning of network parameters can be found in Section 

3. 

Based on the above analysis, the recognition of event is 
highly related to the concepts of object and scene. There¬ 
fore, we expect that the prediction results of both object and 
scene nets are complementary to each other, and combine 
them using late fusion as follows: 

s(I) = a 0 s 0 (I) + a s s s (I), (1) 

where I is the input image, s 0 (I) and s s (I) are the predic¬ 
tion scores of object and scene net, a 0 and a s are the fusion 
weights of object and scene net. In current implementation, 
these two fusion weights are equal to each other. 

2.3. Ensemble of multiple CNNs 

In the past several years, several successful deep CNN 
architectures have been designed for the task of object 
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recognition at the ImageNet Large Scale Visual Recogni¬ 
tion Challenge [7], These architectures can be roughly 
classified into two categories: (i) deep CNN including 
AlexNet [5] and ClarifaiNet [18], (ii) very-deep CNN in¬ 
cluding GoogLeNet [11] and VGGNet [9]. The deep CNN 
architectures usually contain 5 convolutional layers and 3 
fully-connected layers as shown in Figure 1 . The very-deep 
CNN architectures resort to extremely deep structures with 
smaller initial filter size or designing a new inception mod¬ 
ule in a network-in-network manner. The previous studies 
show that deeper networks will obtain better performance 
on the task of object recognition. However, their perfor¬ 
mance on event recognition still remains unknown. 

In this subsection, we exploit these very-deep networks 
in our proposed Object-Scene CNN architecture and aim to 
verify the superior performance of deeper structure. Specif¬ 
ically, we choose the GoogLeNet architecture for both ob¬ 
ject and scene nets. GoogLeNet is a 22-layer very-deep net¬ 
work and is based on a newly-designed module, codenamed 
Inception. To optimize performance, the architectural deci¬ 
sions are based on the Hebbian principle and the intuition of 
multi-scale processing. The details about GoogLeNet archi¬ 
tecture can be found in [11]. We use the GoogLeNet model 
released on the Caffe webpage 3 to initialize the object net. 
For scene CNN, we utilize the pre-trained model released 
in the technical report [16] 4 . 

We also study the complementarity of convolutional neu¬ 
ral networks with different architectures. We combine the 
prediction results of deep OS-CNNs with the ones of very- 
deep OS-CNNs as follows: 

s x (I)=p d s d x (I)+p v - d sl- d (I), (2) 

where x £ {o, s) denotes object net or scene net, s d (I) and 
s v x ~ d (l) are the scores of deep and very-deep CNNs, /3 d and 
/3 v_d are their fusion weights (/3 d = 0.3 and /3 v_d = 0.6). 
Although the very-deep OS-CNN outperforms the deep OS- 
CNN, the combination of them is still able to further boost 
the recognition performance. 

3. Experiments 

In this section, we first describe the dataset of cultural 
event recognition at the ChaLearn LAP challenge 2015. 
Then we give a detailed description of the implementation 
details about training OS-CNNs on the event recognition 
dataset provided by the challenge organizers. Finally, we 
present and analyze the experimental results of proposed 
OS-CNNs on the dataset of ChaLearn LAP challenge 2015. 

3.1. Datasets and evaluation protocal 

Cultural event recognition is a new task at the ChaLearn 
LAP challenge 2015. This task provides an event recogni¬ 

3 https : //github . com/BVLC/caffe/wiki/Model-Zoo 

4 http : //vision.princeton.edu/pvt/GoogLeNet/ 


tion dataset composed of images collected from two image 
search engines (Google Images and Bing Images). There 
are 50 important cultural events from the world in this 
dataset, and some sample images are shown in Figure 2. 
From these images, we see that garments, human poses, ob¬ 
jects, and scene context constitute the possible cues to be 
exploited for recognizing the events. The dataset is divided 
into three parts: development data (5, 875 images), valida¬ 
tion data (2,332 images), and evaluation data (3,569 im¬ 
ages). During develop phase, we train our model on the de¬ 
velopment data and verify its performance on the validation 
data. For final evaluation, we merge the development and 
validation data into a single training data (8, 207 images), 
and re-train our model. Our final submission results to the 
challenge are obtained by the re-trained model. The prin¬ 
cipal quantitative measure used is based on precision/recall 
curve. They use the area under this curve as the computa¬ 
tion of the average precision (AP), which is calculated by 
numerical integration. 

3.2. Implementation details 

The training procedure of OS-CNNs is implemented us¬ 
ing the famous Caffe toolbox [4], Although there are 8,207 
training images in the cultural event recognition dataset, its 
size is relatively small compared with the ImageNet dataset 
[3], Therefore, we choose to pre-train our model on two 
large datasets: ImageNet dataset for object net and Places 
dataset for scene net, as described in Section 2. In order 
to make the deep-learned features more discriminative for 
the task of event recognition, we then fine tune the network 
parameters on the cultural event recognition dataset. 

The network weights are learnt using the mini-batch 
stochastic gradient descent with momentum (set to 0.9). At 
each iteration, a mini-batch of 256 samples is constructed 
by randomly sampling. During training phase, all the im¬ 
ages are resized to 256 x 256, and a 224 x 224 or 227 x 227 
sub-image is randomly cropped from the image. They are 
then manipulated with a random horizontal flipping. The 
dropoutratio for fully-connected layer is set as 0.5. To over¬ 
come the issues of over-fitting, we set the learning rate of 
hidden layers as HP 2 times of final layer. The learning rate 
is initially set to 10~ 2 , and decreased according to a fixed 
schedule: decreasing to 10 -3 after 1.4K iterations, to FT 1 
after 2.8K iterations, and training stopped at 4.2K iterations. 

During testing phase, we resort to a multi-view voting 
method [5] to classify each image. Like training procedure, 
we resize each testing image into 256 x 256. For each CNN, 
we obtain 10 inputs by cropping and flipping four corners 
and the center of the image. The score of this CNN for this 
image can be obtained by averaging the scores across these 
crops. The scores from multiple object and scene nets are 
combined using late fusion. 




Figure 2. Samples of cultural event recognition dataset at the ChaLearn LAP challenge 2015. The cultural event recognition dataset has 
50 important cultural events in the world. It includes: Annual Buffalo Roundup (USA), Battle of the Oranges (Italy), Chinese New Year 
(China), Notting Hill Carnival (UK), Obon (Japan) and so on. All the images are collected from the Internet by using Google and Bing 
search engines. These images exhibit large intra-class variations and are very challenging for event recognition. 


3.3. Experimental results 

Effectiveness of OS-CNN. First, we measure the per¬ 
formance of separate object and scene nets. Three scenar¬ 
ios are considered: (i) only using object net, (ii) only using 
scene net, (ii) using OS-CNN. For each setting, we use the 
deep network architecture: ClarifaiNet for object net and 
AlexNet for scene net. The results are shown in Figure 4. 
From these results, object net outperforms scene net for the 
task of event recognition (mAP 78.8% vs. 74.8%). It is also 
clear that fusion of object and scene nets helps to improve 
the performance to 81.1%. This result indicates there exists 
complementary property between object and scene nets for 
event recognition. 

In order to further investigate this complementarity, we 
visualize the filters of first convolutional layer of object and 
scene nets in Figure 3. There are 96 filters in the first convo¬ 
lutional layers. We observe that both nets may learn some 
common filters indicates by the blue box. Meanwhile, some 
filters indicated by red boxes are only learned by a single 
net. Therefore, object net and scene net may capture com¬ 
mon patterns such edges, but also extract complementary 
information with different filters. 

Evaluation of different architectures. Second, we in¬ 
vestigate the performance of CNNs with different architec¬ 
tures and design three settings: (i) CNN with deep archi¬ 
tecture (AlexNet or ClarifaiNet), (ii) CNN with very-deep 
architecture (GoogLeNet), (iii) combination of both deep 
architecture and very-deep architecture. We conduct this 
comparative study for both object and scene nets. 

The results of object net and scene net are shown in Fig¬ 
ure 5 and Figure 6 respectively. From these results, it is 
clear that deeper architecture obtain better performance for 
event recognition, no matter object net or scene net, which 
agrees with the findings in object recognition. The very- 


Rank 

Team 

Score 

1 

MMLAB (Ours) 

85.5% 

2 

UPC-STP 

76.7% 

3 

MIPAL_SNU 

73.5% 

4 

SBLLCS 

61.0% 

5 

MasterBlaster 

58.2% 

6 

Nyx 

31.9% 


Table 1. Comparison the performance of our five-stream CNN with 
that of other team. Our result is significantly better than others. 

deep architecture outperforms deep architecture by about 
3%. At the same time, we observe that the fusion of differ¬ 
ent architectures can help to further boost the recognition 
performance (about 2% improvement). 

Challenge approach and results. Based on the nu¬ 
merical evaluation and analysis above, we conclude that (i) 
object net is better than scene net, (ii) very-deep architec¬ 
ture outperforms deep architecture, (iii) fusion of multiple 
CNNs from different visual cues (object and scene) with 
different architectures (deep and very-deep) contributes to 
performance improvement. Hence, we introduce another 
object net with very-deep architecture into our OS-CNN 
framework. We pre-train a 19-layer VGGNet on the Ima- 
geNet dataset and fine tune network weights on the train¬ 
ing dataset of cultural event recognition. Totally, our chal¬ 
lenge solution is composed of five-stream CNNs pre-trained 
with different datasets (ImageNet or Places) equipped with 
different network architectures. The challenge results are 
shown in Table 1 . We see that our method obtains the best 
performance and significantly outperforms the second place 
by nearly 10%. 

4. Conclusions 

This paper has presented an effective method for cul¬ 
tural event recognition from still images. We utilize the 








deep CNNs for this task and propose a new architecture, 
called Object-Scene Convolutional Neural Network (OS- 
CNN). This architecture is decomposed into object net and 
scene net, which extract useful information for event under¬ 
standing from the perspective of objects and scene content, 
respectively. Meanwhile, we consider different network 
structures for OS-CNN and conduct a comparative study of 
deep CNN and very-deep CNN for event recognition. We 
show that deeper architecture is also helpful in the task of 
event recognition from still images, and the combination 
of different architectures is able to boost performance. In 
practice, based on our proposed OS-CNN and comparative 
study, we design a five-stream CNN for the track of cultural 
event recognition at the ChaLearn LAP challenge 2015. In 
the future, we may consider jointly optimizing the object 
and scene nets and incorporating more visual cues for event 
understanding. 
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Figure 3. The filters learned in first layer of object net and scene net. Blue box indicates similar filters shared by two CNNs, and red boxes 
denote the filters only learned by a single CNN. 



Figure 4. Results of object net (o-cnn), scene net (s-cnn), and OS-CNN. We plot the average precision (AP) values for the 50 classes and 
the last column indicates the mean AP (mAP) over these classes. 
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Figure 5. Results of object net using different architectures. We plot the average precision (AP) values for the 50 classes and the last 
column indicates the mean AP (mAP) over these classes. 
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Figure 6. Results of scene net using different architectures. We plot the average precision (AP) values for the 50 classes and the last column 
indicates the mean AP (mAP) over these classes. 





























































































































































































































































































